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Abstract: Although discrete mixture modeling has formed the backbone of the literature 
on Bayesian density estimation, there are some well known disadvantages. We propose an 
alternative class of priors based on random nonlinear functions of a uniform latent variable 
with an additive residual. The induced prior for the density is shown to have desirable 
properties including ease of centering on an initial guess for the density, large support, 
posterior consistency and straightforward computation via Gibbs sampling. Some advantages 
over discrete mixtures, such as Dirichlet process mixtures of Gaussian kernels, are discussed 
and illustrated via simulations and an epidemiology application. 
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1. INTRODUCTION 

Nonparametric kernel mixture models are increasingly popular in density estimation and 
high dimensional data modeling. Kernel mixture models have the form: 



where G(-) is a mixing distribution and /C(-) is a probability kernel. The majority of the 
nonparametric Bayesian development in this area relies on Dirichlet process (DP) priors 
(Ferguson, 1973; 1974) for G. These models have been generalized to density regression by 
defining dependence on the covariates x in various ways. Miiller, Elkanli and West (1996) 
used a DP mixture of multivariate normals to jointly model the density of the response and 
predictors to induce a prior on f(y\x). In order to let the parameters of the DP vary over 
the predictor space X, MacEachern (1999) defined dependent Dirichlet processes (DDP) by 
assigning stochastic processes on the components in Sethuraman's (1994) DP representation: 
G x = Y^iPii. x )^dr{x)- De Iorio et al. (2004) proposed a fixed-p DDP, while Griffin and Steel 
(2006) allowed the weights to depend on predictors. Dunson, Pillai and Park (2007) instead 
used predictor-dependent convex combinations of DP components. 

There is also a rich literature on using mixture priors in hierarchical latent variable 
models. Bush and MacEachern (1996) and Kleinman and Ibrahim (1998) proposed DP mix- 
tures on the distributions of random effects. Fokoue and Titterington (2003) and Fokoue 
(2005) proposed mixtures of factor analyzers (MFA) corresponding to a finite mixture of 
multivariate normal kernels with a factor-analytic decomposition of the component-specific 
covariances. Dunson (2006) used dynamic mixtures of DPs to allow a latent variable distri- 
bution to change nonparametrically across groups. More recently, Chen et al. (2009) and 
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Carvalho et al. (2008) proposed nonparametric Bayes MFA allowing an uncertain number 
of factors. Lee, Lu, and Song (2008) placed a truncated DP on the distribution of the la- 
tent variables within a structural equation model (SEM), while Yang and Dunson (2010) 
proposed a centering approach to ensure identifiability of the latent factor distributions. 

The above approaches have relied on discrete mixture models, which have a number of 
well known complications motivating alternative methods for modeling unknown densities, 
such as Polya trees (Mauldin et al, 1992; Lavine, 1992, 1994) and logistic Gaussian processes 
(LGP) (Lenk 1988, 1991; Tokdar 2007). Polya trees have appealing properties in terms of 
denseness, conjugacy and posterior consistency but have disadvantages in terms of favoring 
overly spiky densities. LGP has sound theoretical properties and smoothness of the densities 
can be controlled through the covariance kernel in the GP. However, posterior computation 
is a major hurdle. Recently, Jara and Hanson (2010) proposed dependent tail-free processes 
where they modeled the tail-free probabilities with LGP dependent on covariates. Their 
approach is shown to approximate the Polya tree marginally at each predictor value. An 
alternative was suggested by Tokdar, Zhu and Ghosh (2010) relying on LGP for density 
regression with dimensionality reduction. 

In this article, we focus on a new approach for nonparametric density estimation and 
regression that induces a prior on the unknown density through placing a flexible prior on 
a nonlinear regression function 9 in a latent factor model. The proposed class of models 
is related to Gaussian process latent variable models (GP-LVM) proposed in the machine 
learning literature (Lawrence, 2005; Silva and Gramacy, 2010), but our modeling details are 
different and the focus of this literature has been on nonlinear dimensionality reduction with 
no consideration of density estimation and associated properties. By using GP priors for 6, 
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we obtain substantial control over the smoothness of the induced densities in a very different 
manner than that achieved by LGP-based models. Unlike LGP-based models, the proposed 
model has desirable conjugacy properties facilitating posterior computation. In addition, 
the method has appealing theoretical properties in terms of large support and posterior 
consistency. 

Relative to some density estimation priors, the proposed latent factor approach is quite 
easy to generalize to more challenging settings involving multivariate densities, conditional 
density estimation, hierarchical modeling and other complexities. Although our primary 
focus in this article is to introduce the formulation, providing an intuition for how the model 
works, basic properties and computation, we also give a flavor of generalizations through 
a simple conditional density estimation example. In particular, we consider a model that 
induces a prior on the conditional density f(y\x) through joint modeling of the response and 
predictors through separate nonparametric latent factor models containing the same latent 
variables. This formulation is completely flexible in the marginal densities, while making 
strong restrictions on the dependence to address the curse of dimensionality in a related 
manner to a copula model. An attractive feature of our model is that it naturally allows 
for incorporation of prior information on the marginal densities of response and predictors 
through the mean function of the GP. The utility of incorporating such prior information is 
clear in a reproductive epidemiology application we consider. 

2. DENSITY ESTIMATION 
2.1. Model Specification 

Initially suppose yi are iid draws from an unknown density / G J 7 , where T is the set of 



4 



densities on 3? with respect to Lesbesgue measure. We propose to induce a prior / ~ II 
through 

Hi = fl(Xi) + €i, €i ~ r a , 
\i ~ II*, a ~ v, Xi ~ Uniform (0, 1), (2) 

where \x G is an unknown [0, 1] — > 3? function, Xi is a uniformly distributed latent variable, 
and the error distribution r CT is centered at and has scale parameter a. Hence, in the 
special case in which = /i, so that the regression function is a constant, and IV is 
normal, we have f(y;n,a 2 ) = N(y; fi,a 2 ) so we obtain a normal density. The density of y 
conditionally on the unknown regression function fi and a is obtained on marginalizing out 
the latent variable as 

f{y, V, <r) = ftiAv) = I ? a {y- n{x))dx. (3) 
To complete the specification, we let \x ~ II*, a ~ v and obtain the marginal density 

/*oo p p 1 

f(y)= / T a {y-^x))dxW{d^)v{do). (4) 
Jo Je Jo 

Hence, a prior / ~ n is induced through assigning independent priors to /i and a in expression 
Q. When the prior on /x is a Gaussian process and the error distribution is iV(0, a 2 ) (denoted 
as r CT = (pa), we refer to f^ a as a Gaussian process transfer (GPT) model and the induced 
prior / ~ n as a GPT prior. 

The GPT prior does not have the kernel mixture form Q. There will be no clustering 
of subjects or label switching issues. Instead, the prior / ~ n is induced through adding a 
Gaussian residual to a Gaussian process regression model in a uniform latent variable. This is 
a simple structure aiding computation and interpretability. One can control the smoothness 
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of the density through the covariance in the GP prior for the regression function \i and the size 
of the scale parameter a. In limiting cases, one can obtain realizations of \i concentrated close 
to a flat line, leading to a normal density as a special case. In addition, by making a small and 
choosing the GP covariance to generate a very bumpy ji, one can obtain arbitrarily bumpy 
densities. In practice, by choosing hyperpriors for key covariance parameters, we obtain a 
data adaptive approach that often outperforms discrete kernel mixtures. The performance 
of discrete kernel mixtures relies on the ability to accurately approximate the density with 
few components, and DP mixtures tend to heavily favor a small number of dominate kernels. 
This tendency can sometimes lead to relatively poor estimation, as illustrated in section 6. 

2.2. Prior Specification 

Prior elicitation is an important aspect of Bayesian modeling, with the prior playing a 
particularly important role in Bayesian nonparametric models involving infinitely many pa- 
rameters. Most of the Bayesian nonparametrics literature relies on default priors, which do 
not reflect available prior knowledge in a particular application area, but are chosen to lead 
to good performance in terms of posterior behavior in a wide variety of applications. How- 
ever, as in parametric models, well chosen informative priors that utilize information, such 
as historical data on the variables under study, can substantially improve the performance in 
small to moderate samples. In DP mixtures, such prior information is typically incorporated 
through choice of hyperparameters in the base measure, while maintaining conjugacy for ease 
in computation. For example, in Gaussian kernel mixtures, a normal-inverse gamma base 
measure would be chosen having parameters representing prior knowledge. This is appropri- 
ate when prior knowledge implies that the density follows a t distribution, but when one has 
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prior information that the density follows a more complex form (as in our premature delivery 
application) then elicitation is substantially more difficult. Obtaining a base measure that 
leads to a particular elicited density is a deconvolution problem, which can be difficult to 
solve for non-atomic base measures. In addition, posterior computation under the resulting 
complex and non-conjugate base measure may be challenging. An advantage of the GPT is 
that the prior for the density can be centered on an arbitrary choice easily through the prior 
mean in the GP prior for /i. 

To elaborate, Theorem 3 (section 3) ensures that f^ a ~ f^, a , when ji pa /i* = F" 1 and 
a ~ 0. In terms of application, this translates to incorporating a prior guess f^ a for the 
density through the corresponding mean function /i* = F~ l of the GP, and letting the prior 
for a to have mode near 0. Such a mean function can be constructed by obtaining frequentist 
kernel estimates of the concerned density using some external data, and then converting it 
into an inverse cdf on a grid of points in [0,1] (using a linear approximation). Thus, the 
characteristics of the entire density is captured through F~ x (mean function of the GP) and 
we let the data influence the deviation of the posterior from the prior guess. 

These ideas are demonstrated in Figure 1, where we use some earlier data on gestational 
age at delivery to construct prior densities. We choose a Ga(25,l) prior for the residual 
precision, and different sets of hyper-parameters for the covariance kernel of the GP. The 
frequentist kernel estimates were obtained by the bandwidth selection method of Sheather 
and Jones (1991), using a Gaussian kernel ('kernel' function in R). It is evident that the 
smoothness as well as the degree of deviation of the prior from the frequentist estimate can 
be controlled through the hyper-parameters in the covariance kernel of the GP. 
3. Theoretical Properties 
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To further justify the proposed prior, we show large support and posterior consistency prop- 
erties. Large support is an important property in that it ensures that our prior can generate 
densities that are arbitrarily close to any true density / in a large class, a defining property 
for a nonparametric Bayesian procedure and a necessary condition to allow the posterior 
to concentrate in small neighborhoods of the truth. Instead of focusing narrowly on GPT 
priors, we provide broad theoretical results for priors in the general class of expression (J5]). 

Before proceeding, it is necessary to define some notation and concepts. We denote 
the Kullback-Leibler (KL) divergence of f^ a from f as KL(f^ a , f Q ) and an e— sized KL 
neighborhood around / as KL e (f ). The sup-norm distance is denoted by ||.||oo- Our 
development will rely on the fact that any density /o can be calculated from a "true function" 
/i = Fq 1 , with F denoting the cumulative distribution function. To generate yi ~ f , one 
can equivalently draw 2j ~ Uniform(0, 1) and let t/i = fio(xi). This corresponds to the 
limiting case as a — > in model (J2j) with ft = /i . Hence, we assume a weak regularity 
condition that the true density can be represented as the limiting case 



assuming IV is chosen so that such a limit exists. The above condition is quite reasonable, 
only excluding densities for which convergence in distribution does not imply convergence 
of the corresponding density functions. As additional reasonable regularity conditions, we 
assume that fo is strictly positive and finite, sup x |/io(a;)| < oo and < T a (u) < oo for finite 
u. A uniformly bounded n$ along with the condition on T a implies that for fi belonging to 
a ball of finite radius around fio, f^a is strictly positive and finite for all a e which 
ensures a finite KL divergence for a suitable subset of /i values. 
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Theorem 1. Let sup x \fJ,o(x)\ < oo, T a be normal, Laplace or Cauchy with scale parameter a 
and /o be the corresponding density in T defined as in equation |5]). If n$ is in the sup-norm 
support of IT and v^a : a G (0, 77) j > for all n > 0, then U(KL t (f )) > for all e > 0. 

Theorem 1 allows us to verify that the induced prior on the density / assigns positive 
probability to KL neighborhoods of any strictly positive and finite true density /q. From 
Schwartz (1965), if the true density /o is in the KL support of the prior for /, the posterior 
distribution for / will concentrate asymptotically in arbitrarily small weak neighborhoods of 
fo- Theorem 1 requires the prior \x ~ II* to place positive probability in sup-norm neighbor- 
hoods of the inverse cdfF _1 . Although one can verify this condition for certain choices of 
IT, such as appropriately chosen Gaussian process priors, it is nonetheless somewhat strin- 
gent. We show in Theorem 2 that this condition can be relaxed to only require that the 
prior fi ~ IT assigns positive probability to L-l neighborhoods of any element /io of O. It 
is well known that positive sup-norm support automatically guarantees positive L-l support 
but the converse is not true. 

Theorem 2. Let sup x \[J>o(x)\ < oo, T a = <p a and fo be the corresponding density in J 7 defined 
in equation (5 ). If jiQ is in the L-l support of II* and u^a : a £ (0, tj)\ > for all r/ > 0, then 
U(KL e (f )) > for all e > 0. 

As the prior / ~ II is specified indirectly through priors // ~ IT* and a ~ v, it is desirable 
for elicitation purposes to verify that, for sufficiently small a, fi ~ (1q implies that f^ ~ Jo- 
Theorem 3 provides such a verification assuming Gaussian errors. This implies one can 
potentially center the prior for the density / on an initial parametric guess / by centering 
fi ~ II* on the inverse cdf F~ l while choosing the prior for a to have mode near zero. The 
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data will then inform about the degree to which fi deviates from F 1 and a deviates from 0. 

Theorem 3. For fio G and T a = <p a , let fo be the density resulting from equation Then 
for n G N ei (/io), with iV ei (/i ) an e x -sized L-l neighborhood around /i , and a G (e 2 , e 2 ), we 
have fn,a £ An (/ ) for arbitrarily small €±,62, e* 2 such that < E\ < e 2 < t* 2 . 

Although Theorems 1-2 lead to weak posterior consistency, small weak neighborhoods 
around f are topologically too large and may include densities that are quite different 
from fo in shape and other characteristics. Hence, it is appealing to establish a strong 
posterior consistency result in which the posterior probability allocated to arbitrarily small 
L-l neighborhoods of fo increases towards one exponentially fast with increasing sample size. 
Focusing on the GPT prior described above, we show in Theorem 4 that strong posterior 
consistency holds under some conditions on the prior. Notably, for an appropriately chosen 
GPT prior, we obtain L-l posterior consistency for all strictly positive and finite true densities 
fo G T with the weak regularity condition (5). 

Theorem 4. Suppose /x ~ GP(m,c) and define fo as in equation |5p where T a = Let 
U={fn t a : / \fn,o- — fa\dy < e, \i G 6,cr G (0,oo)}. Suppose the mean function m(-) is con- 
tinuously differentiable, and the covariance function c(-, ■) has continuous fourth derivatives. 
Further < vio : a G (0,i]i)) < 772 for any arbitrarily small r\\, with 772 decreasing with r)\. 
Then, fo is in the KL support of TI implies that the posterior is strongly consistent at fo- 

The above assumptions on the GP can be verified for many popular covariance functions, 
both stationary and nonstationary. Some such examples can be found in Choi et. al. (2004). 

4. SINGLE FACTOR DENSITY REGRESSION 

As a simple and parsimonious single factor model that generalizes the model of Section 2 to 
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include predictors Z{ = (zn, . . . , z ip )' of a response y*, we let 

2/i = /i y (xi) + ej, e, ~ N(0,a Y ), 

z ik = v Zk {xi) + e* h , 4~ N{0,a 2 Zk ),k = l,2,...,p, 

Xi ~ Uniform(0, 1), 

/i y ~ n y , /i Zfc ~n Zfc ,A; = l,2,...,p, 



(6) 



where /z y ,/i Zfc G are unknown [0,1] — >■ 3? functions, e's are independent errors and Xj 
is the latent variable. For simplicity, we assume the same prior v on the precision of the 
measurement errors in each component model, though this assumption is trivial to relax. 
Expression ^ is a multivariate generalization of the univariate density estimation model (J5]). 
Marginally each of the variables is assigned exactly the prior in ^ and to allow dependence 
we incorporate the same latent factor X{ in each of the models. 

Our goal in defining a joint model is to induce a flexible but parsimonious model for 
the conditional density of jji given the predictors Zj. In estimating conditional densities for 
multiple predictors, one encounters a daunting dimensionality problem in that one is at- 
tempting to estimate a density nonparametrically while allowing arbitrary changes in this 
density across a multivariate predictor space. Clearly, as p increases even for large sam- 
ples there will be many regions of the predictor space that have sparse observations. As 
a compromise between flexibility and parsimony in addressing the curse of dimensionality, 
we propose to use a single factor model in which the marginals for each variable are fully 
flexible but restrictions come in through assuming dependence on a single Xj. Extensions to 
the multiple factor case are straightforward. 
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5. POSTERIOR COMPUTATION 

For simplicity, we focus on the single predictor density regression case when outlining an 
MCMC algorithm for posterior computation. Let Y nxl and Z Nx i denote the vector of 
observations and covariates, respectively. We are interested in prediction of y n+ i, . . . ,i/n 
based on z n+ i, . . . , zn- Let /i Y (n x 1) and ji^ (N x 1) denote the realizations of the GP fi Y 
and \x z at the latent variable values x = (x±, . . . , x n , x n+1 , xn)'. From the GP prior, we have 
/iy ~ N„(my, K y ) and fj,^ ~ ^(m^, Kf). The covariance kernels are squared exponential 
with K Y (x,x') = ^exp | — Cy{x — a;') 2 j and K z (x,x') = exp | — Cz(x — x') 2 1. We 
specify conjugate gamma priors: a Y 2 ~ Ga(a a ,b a ), a^ 2 ~ Ga( y aa a .,bb a ), 0y ~ Gai^a^^b^) 
and 4>z ~ Ga^aa^, bb^). For updating the latent variables x, we adopt the griddy Gibbs 
approach using a set of evenly distributed grid points <?*, g^, ■ ■ . , g G G (0,1). Let Dy and D 2 
be diagonal matrices having a\ and <r| as their diagonal elements respectively. Let n Y {— i) 
include all elements of fi Y except fi Y (xi), and similarly for fi^(-i). The Gibbs sampling 
algorithm alternates between the following steps. 

Stepl: Update a Y and <r| using -k(o y 2 \ — ) ~Ga(ao-+n/2, b CT + \ Y^i=i(Vi ~ ^ y ( x «)) 2 ) an d 

^(o'z 2 ! - ) ~Ga(aa CT +N/2, bb CT + \ Yn =1 (zi - fi z {xi)) 2 ) respectively. 

Step2: To sample the latent variables, choose = g\ with probability p ik , where 



= Plx p^N^jx, = gDlM-tyNQiZjxi = glM(-i)) . fi< n 

pf fe ^(/^ = ^)|/4H)) 



, if n < i < N, 

where p£ = N( yi ; /j, Y ( Xi = g* k ), a Y ), pf k = N( Zi ; fi z ( Xi = g* k ),a 2 z ) and k=l,2, ...,G. 
Step3: Update /4 and /if using ?r(^|-) = iV„((D y 1 + (Ky)- 1 )- 1 (D y 1 F+(Ky)- 1 my), (D y x + 
(Ky)- 1 )- 1 ) and vr(^l-) = i%((D^ + (Kf )- 1 )- 1 (D^ 1 Z + (Kf )~ 1 mf ), (D, 1 + (Kf )- 1 )- 1 ) 
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respectively. 

Step4: Update fif = {fi Y (g$, . . . , pt Y (g* G )} and fif = {n z {gl), . . . , pL Z (g* G )} using the 

conditional normal distributions N(/iy G |/z y ) and N(/i^ G |/i^) respectively. 

StepJ: Update (fi Y and Z using ir(<f> Y \—) ~Ga(a^ + |, + |(/x y - m y )'(K y ) _1 (/x y - m Y ) 

) and 7t(0 z |-) ~Ga(aa^ + f , bb^ + - )'(Kf - mf ) ) respectively. 

Step6: Update C Y and C z using Metropolis random walk for log(C y ) and log(C^). 

For prediction of yu based on Zk, k = n + 1, . . . , N, we use 7r(yfe|— ) = N(yk] fi Y (xk),cr Y ), while 



the conditional density estimate is calculated as f(y\z) 



^ESr=i^z(*-M z (fliE)) 
6. SIMULATION STUDY 

To assess the performance of the GPT approach in density estimation as well as density 
regression, we conducted several simulation studies. We chose the mean function for the 
GP as m(x)=2sin(x)+cos(x) and utilized the squared exponential covariance kernel. For 
computational purposes, we worked with the standardized data and then transformed it 
back in the final step. The hyperparameters for the gamma priors were chosen to be one 
throughout. Although we used 75 grid points for the griddy Gibbs approach, the number of 
points could be as low as 60. The number of iterations used was 10000 with a burn in of 
1000. The convergence for the main quantities such as /i was rapid with good mixing. All 
results are reported over 5 replicates. 
6.1. Univariate Density Estimation 

To see how well the GPT does in practice for density estimation, we looked at a variety of 
scenarios, where the truth was generated from the densities considered in Marron and Wand 
(1992), which are essentially finite mixtures of Gaussians. We present the results from four 
of those cases which we thought to be interesting deviations from normality and could be 
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potentially encountered in applications. These are the 2nd, 6th, 8th and 9th Marron-Wand 
densities. The sample size used was 100. For comparison, we looked at DP mixture of 
Gaussians (Escobar and West, 1995), mixtures of Polya trees (Hanson, 2006) and frequentist 
kernel estimates using a Gaussian kernel (and the bandwidth selection method of Sheather 
and Jones, 1991). More specifically, for both DP mixtures and mixtures of Polya trees, 
we used the DP package in R and the standard hyperparameter values therein. We used 
algorithm 8 of Neal (2000) with m=l for DP mixtures of Gaussians. For frequentist kernel, 
we used the function "density" in R with Gaussian kernel. Overall, we found that varying 
the hyperparameter values within a reasonable range does not significantly alter the density 
estimation results for a sample size of 100, for any of the competitors. Table 1 presents the 
L-l distance between true and estimated densities while Figure 2 depicts the density plots. 
Table 1: Marron-Wand Curves: L-l distance between true and estimated densities 



Method 




L-l Distance 








MW 2 


MW 6 


MW 8 


MW 9 


GPT 


0.031 


0.035 


0.031 


0.028 


DPM 


0.035 


0.036 


0.03 


0.038 


Polya tree mixture 


0.065 


0.036 


0.045 


0.042 


Frequentist Kernel 


0.145 


0.031 


0.033 


0.028 



From table 1, we see that even when the truth is generated from a finite mixture of Gaussians, 
the GPT tends to do better or at least as well as the DP mixture of Gaussians. Mixtures of 
Polya trees have somewhat worse performance and result in overly spiky looking estimates. 
6.2. Single Factor Density Regression 

For density regression, we generated a univariate response by allowing the conditional mean 
as well as the residual error distribution to vary with the covariate. We compared the out of 
sample predictive performance of GPT with other competitors such as DP mixture of bivari- 
ate normals (Miiller, Erkanli and West, 1996), Bayesian additive regression trees (BART) 
(Chipman, George and McCulloch, 2010), GP mean regression (O'Hagan and Kingman, 
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1978) and treed GP (Gramacy and Lee, 2008), based on standard packages in R. We used 
the DP package for DP mixtures of Gaussians and the Bayestree package for the other three 
methods, and the hyperparameter values therein. The density regression results did not 
change significantly on varying the hyperparameter values within a reasonable range, for all 
the competitors. We used the following scheme for simulations: 

1 + e 2 * j 1 + e z » 

where Fz is the distribution of the predictors which was chosen to be a trimodal density (9th 
Marron-Wand curve). We chose A = 3 and split the total sample size of 100 into training set 
of 50 and test set of 50. The above data generating model allows the shape of the conditional 
density to change with predictors, hence making prediction non-trivial. Table 2 shows the 
performance of the GPT along with a few competitors. We computed the mean square error 
(MSE), 95% coverage for the mean (COV), as well as the L-l distance between true and 
estimated densities at 25th, 50th and 75th percentiles of the predictor distribution. 
Table 2: Out of Sample MSE and L-l distance between true and estimated densities 



Method 


MSE 


COV(%) 




L-l Distance 










25th 


50th 


75th 


GPT 


1.26 


94 


0.08 


0.04 


0.06 


DPM 


1.53 


42 


0.03 


0.04 


0.06 


BART 


1.59 


46 


0.13 


0.026 


0.10 


GP reg 


1.52 


76 


0.14 


0.03 


0.10 


treed GP 


1.6 


72 


0.07 


0.09 


0.04 



The results in table 2 are consistent with our experience in simulations- when the predictor 
distribution is multimodal and the shape of the conditional density is allowed to change with 
predictors, then the GPT tends to do as well or better than DP mixture of Gaussians. For 
the above study, the average number of components in the conditional distribution obtained 
from DP mixtures was around 15 which is quite high for a sample size of 50. As illustrated in 
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table 2, BART, treed GP and the GP mean regression methods are primarily mean regression 
methods and so cannot possibly do well in terms of characterizing the entire conditional of 
response given predictors. They might perhaps estimate the mean surface reasonably well, 
but eventually fail in capturing multimodality or tail behavior, the latter often being an 
important focus of inferences. 

7. EPIDEMIOLOGY APPLICATION 

7.1 Study Background 

DDT is a cheap and popular alternative for reducing the transmission of malaria, but has 
been shown to have negative effects on public health. In order to study the association 
between the DDT metabolite DDE and preterm delivery, Longnecker et al. (2001) measured 
DDE in mother's serum in the third trimester of pregnancy and also recorded gestational age 
at delivery (GAD) as well as age. They did logistic regression with response as dichotomized 
GAD (preterm or normal depending on a cut-off of 37 weeks of completed gestation) and ex- 
planatory variables as categorized DDE based on empirical quantiles. Their results showed a 
significant dose-response relationship which had important public health implications. Dun- 
son et al. (2008) analyzed the data using kernel stick-breaking processes, and showed an 
increasing bump in the left tail of the GAD density with increasing DDE. 

7.2 Analysis and Results 

We used the GPT to analyze the dose response relationship in a subset of 182 women of 
advanced maternal age (> 35 yrs) in the above dataset. We examined the conditional 
distribution of GAD at 10th, 60th, 90th and 99th percentile of DDE. Further, we looked at 
the dose response relationship between preterm birth and DDE, by examining the left tail 
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of GAD over varying doses of DDE. We used normalized data for analysis and converted 
it back in the final step. Using the prior specification approach of section 2.2, we were 
able to incorporate prior information on the marginal density of GAD (using an external 
data) through the mean function of the GP. Note that prior on a~ 2 for GAD was chosen 
as Ga(25,l). Given the limited sample size and the complexity of the data we are trying to 
model, we adjusted other hyperparameter settings to reflect our prior belief about the data. 
The starting value for the length-scale parameter in the covariance kernel in the Metropolis 
random walk was chosen to be 25, so as to have smooth Gaussian process prior. Instead of 
working with DDE, we used log(DDE) which resembled a Gaussian distribution, with a 
mean function for the predictor component and Ga(l,l) prior for the corresponding residual 
precision. 

Figure 3 shows the conditional distribution curves for GPT along with 90% credible in- 
tervals. Although we focused on a small subsample of 182 women of advanced maternal age, 
the GPT results for the conditional density are remarkably similar to the ones reported in 
Dunson et al. (2008), which suggests that there is no systematic difference for women of 
advanced maternal age. The conditional densities show an increasing bump in the left tail 
with increasing DDE, suggesting increased risk of preterm birth at higher doses. This is 
further supported by dose- response curves for P(GAD<T) in Figure 5, with different choices 
for cut-off T. Although the dose-response curve is mostly flat for T=33 weeks, the rela- 
tionship becomes more significant as cut-off increases, with the dose-response tapering off 
at T=40 weeks. This suggests that increased risk of preterm birth at higher DDE dosage 
is attributable to premature deliveries between 33 and 37 weeks. Trace plots of i(y\z) for 
different DDE percentiles (not shown) exhibit excellent rates of convergence and mixing. 
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For comparison, Figure 4 shows the density estimates from the DP mixture of Gaussians 
which has a tendency to overly favor multimodal densities, which is as expected given our 
simulation study results. These results were obtained using DP package in R (and the data 
driven hyperparameter values therein), which utilizes algorithm 8 of Neal (2000) with m=l. 

8. Discussion 

In this paper, we propose a latent factor model for density estimation. This novel method pro- 
vides us with a flexible non-discrete mixture alternative to be used in a variety of situations 
including density estimation, density regression, hierarchical latent variable models and even 
mixed models. We provide theoretical theoretical justifications for GPT and demonstrate 
it's usefulness as a building block for more complex models involving covariates. Building 
on our work, Pati, Bhattacharya and Dunson (to be submitted) recently showed minimax 
optimal rates of posterior contraction for Bayesian density estimation from non-linear la- 
tent variable models, also obtaining initial results on contraction rates in conditional density 
estimation. The close relationship between non-linear latent variable models for densities 
and non-linear mean regression models facilitates not only posterior computation but also 
derivations of theoretical properties, such as contraction rates, which have proven difficult 
to study for discrete mixtures beyond simple settings. 
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APPENDIX: PROOF OF RESULTS 
Proof of Theorem 1: 

That f M)CT exists is seen from equation (|3]). The limiting case as a — > can be explained 
as the case when Y has the distribution function F= /i -1 and marginalizing out the latent 
variable xG [0,1]. Thus lim^o J T a (y — [i(x))dx exists. Under regularity conditions, we 
can write the KL divergence between f^ and fo as 

lim^o fo F a (y - fi (x))dx 



KL(f^,f ) = // O log A= f fo (y) 



dy. (7) 



Note, r CT (y - fi(x)) > J^h-MV where = e ^ ( ^° )2 -^° )( ^ o) for nor- 

mal error, while for Laplace error, h a (y,fi — /x ) = e ^ . Further sup x \ \og(h a (y,fi(x) — 
fjLo(x)))\ — > for all y £ 5ft as ||/x — ^o||oo 5 go to with — /io||oo/<7 2 — > 0. Under regularity 
conditions, -fa- has a finite upper bound, hence we can use dominated convergence theorem sub- 
sequently. Observe, for a fixed a, J* T a {y - n(x))dx > suPxe[01)K ^ {x) _^ {x)) Jo T Av ~ Po(z))dx. 
As ||/i — Ho\\oo, c 2 go to with \\fi — ^o||oo/c 2 — > 0, and applying dominated convergence theorem, 
< 4/0(2/) log 

UAy)^ — 1™°"^° lim ||M— moIIoo— s-o ^og{sup x h a (y, fi(x) fio(x))) — >■ 0. For Cauchy 
errors, f (y) = lim^o Jq ^7 —dx and 



UAv) = Jo h-/ \ dx = Jo h- ( \ dx - 

I 1+ (v-rf J l l+4 r [(»- W) (*)) a -2(»-Aio(a!))(M(*)-W)(*))+(A»(ai)-A«(ai)) 2 ] J 

As \\fjL — )Uo||ooj °" 2 go to with — /iolloo/o" 2 — ► 0, and applying dominated convergence 
theorem, J^ fo(y) log j^^ dy — > 0. Thus taking appropriate limits, KL(/ Aicr , / ) goes to 
under Gaussian, Laplace and Cauchy errors. Hence we can choose a suitably small e\ and 
e 2 with < ei < e 2 such that — Mo||oc < £i,0 < cr < e 2 j => KL{f^ a , fo) < e. Using the 
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assumptions on the support of priors II* and v, we have U(KL e (fo)) > 0. 
Proof of Theorem 2: 

Using the regularity conditions and Taylor's series expansion, we have for fixed y, fx, a, 

, /o(y) sr ( {(fo(y) - i) k - (UAv) - V k } , , y( , , V( w , , 

log- = 2^( _1 ) E ^ + (n ) - df (n ) , for a fixed n , 

where S\ (no)— 5% (no) is uniformly bounded in y. Using the identity a ri -b ri =(a-b)(^^ =1 a n ~ k b k ~ 1 ) 
and denoting g = f - 1 and ^ = / M;(7 - 1, we have, 

,{(/o-l) fc -(/^-l) fc }„ ^ /".(-I)* 



fc=i fe=i ^ «=i 

^ sup |L_^(^^-i)| / |/ (y)-/ wr (y)|^ = lf(no) / |/o(y) - f^Wv, 
i. i y i i J J 



k=l k=l 

no , 1 % y. k 

< 

k=l v 2=1 

where K(no)=Xwb=i su Py \^~^~(Y^i=i d^Jdb' 1 )] i s a finite constant depending on n , for fx 



belonging to a finite L-l ball around /io, using the regularity conditions. Further, using 
similar methods as in the proof of theorem 3, we can show that for e\ < e 2 < e* 2 , 

{fx G N ei (f, ),ae (e 2 ,e* 2 )} [ \f (y) - U, a {v)\dy < -. (8) 

J £2 



Using inequality ^8j), we have for fx G N ei (fx ) and a G (€2,62), 
/ I E (-I)' {(/ ° " 1)fc ~ ^ ~ \dy < K(n )^. Also note, KL(f , /„,„) < // |logA| 

= //olE (-!)" {(/ °~ 1)fc ~ (/M " J ~ 1)fc} + *f (n ) - .5? (n ) | < K(n )^ + A(n ) = c, 
^ fc=i 62 

for a finite K(no) and suitably small A (no) = sup y \5f(n ) — <5f (n )| with e\, e 2 depending on 
e. Under positive L-l support by IT, the rest follows by similar arguments as in theorem 1. 

Proof of Theorem 3: 

Note that, \ fu,a(y) - fo(y)\ < \sup xe ( 0> i)<f) a (y - fi(x)) - / |, which is integrable V(u,cr). 
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Then using dominated convergence theorem, lim^o f \f^ a — fo\ = |liin ff ->o \fn,a~ fo\ ■ Now 
applying Fatou's Lemma and Fubini's Theorem successively, we have, 

lim \fw - f \dy < / lim / \4> a {y - fi(x)) - <j> a (y - fio(x))\6x dy 

(J— >0 I cr— »U 







< 



liminf / / ItpJy — u(x)) — (j)Jy — fj, (x))\dx dy (Fatou's lemma) 
J Jo 

liminf / \4>a(y — M x )) ~ ^a-iy — Uo(x))\dy dx (Fubini's Theorem) 
Jo J 



= liminf ^ / / \(f)a(y - fi(x)) - (p a (y - /x (x))|dy dx 
+ / / \<t>a(y ~ Vo{x)) - <t>*(y - v{x))\dy dx 

In the proof of lemma 1 of Ghosal, Ghosh and Ramamoorthy (1999), it was shown that for 
fixed 9 X < 9 2 , U a {y - d{) - - 02)|| < h ^ x , which would imply 

lim \f„ - f \dy < liminf// ^'^ dx+f ^l^°M dx 

= liminf \m^Mi dx < Hm ^^M dx . (9) 



""^o jo °" CT ^° Jo G 

As II a 2 go to with | |/x - /Uo||oo/o- 2 -> 0, the above limit exists and goes to 0. Given 

that the limit exists and goes to 0, we can now choose sufficiently small ex, €2,^2 w hh < e± < 
62 < e* 2 such that for \fio(x) — /J,(x)\dx < e\ = y, E N ei (no) and a G (£2,62), we would have 
I\fw-h\dy< f. using (9) . 



Proof of Theorem 4: 

Our proof is based on theorem 2 of Ghosal, Ghosh and Ramamoorthi (1999) who gave a 
set of alternate sufficient conditions for almost sure convergence of the posterior of strong 
neighborhoods. Their result involves conditions on the size of the parameter space in terms 
of L-l metric entropy. Before proceeding, let us review L-l metric entropy and theorem 2 of 
Ghosal, Ghosh and Ramamoorthi (1999). 
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DEFINITION 1. For Q C T and 5 > 0, L-l metric entropy J(S,Q) is defined as the 
minimum of log(fc : Q C U* =1 {/ : / |/ - f { \dy < 5, f u f 2 , . . . , f k G J 7 }). 

Theorem 5. (Ghosal, Ghosh and Ramamoorthi) Let IT be a prior on T . Suppose E J 7 is 

in the Kullback-Leibler support of II and let U={f : J \ f — fo\dy < e}. If there is a 5 < e/4 7 

Ci, c 2 > 0, /3 < e 2 /8 and J n C J snc/j t/iat /or a// Zarge n: 

(%) n(J^) < ciexp(— nc 2 ), and, 

^ T/ie L-l metric entropy, J(5,J r n ) < n(3, 

ihenn(U\Y 1 ,Y 2 ,...,Y n )-H a.s. P fo . 

The constants 5, c±, c 2 , (3 and J-" n are allowed to depend on e. 

Let the parameter space for (//, a) be denoted as "H. Consider the subsets of the parameter 
space U n = U ln ® H 2n , where H ln = {/i : \\fJ,\\oo < ^n, H^'Hoc < M n } and "H 2n = [^n, oo), 
with L n ->■ such that i/(cr G (0, L n )) < diexp(-d 2 n), di,d 2 > and M n = 0(n 1/2 ). The 
regularity conditions on the GP guarantees existence of the first derivative // with probability 
1. Using lemma 4 of Choi and Schervish (2004) who showed an upper bound on sup-norm 
metric entropy of "Hi„, we have the upper bound on LI metric entropy as 

J(5,H ln ) < K x M n /8. (10) 

This implies there are K* = exp(K 1 M n /5) elements /ii,/i 2 , . . . , \ik* such that 

n ln C Uf = ; j/j : jf |/i - /i.lda; < <j| . (11) 

Let us consider the sieve T n = G J 7 : (//, cr) G "H„}. Further, let us consider densities 
fj,n — //ii,L„ e 3~ n defined as in section 2, where the /ij, i = 1, . . . , X* correspond to the ones 
just defined to cover T-Li n . Using similar techniques as in lemma 1 of Ghosal, Ghosh and 
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Ramamoorthi (1999) and theorem 3, it can be shown that for / Mj(T e J^, 

/ \U„ - kn\dy < f l -^=^dx + f l -^^dx < f 1 \n - n,\dx < f .(12) 

J Jo V27TCT Jo V27rL„ V nL n Jo 

This clearly implies 

J(5/L n ,F n ) = J&HijK^Mn/S^J&FjK^MnLJSKnP, (13) 

where we can choose 5 < e/4 such that (5 < e 2 /8. Thus the second condition in theorem 6 
is satisfied. Also note that the prior probability of T n can be calculated in terms of IT* and 
v. Using the regularity conditions on the GP and z/, and similar reasoning as in lemma 5 of 
Choi and Schervish (2004), 

n(^) = (IT ® v){W n ) < c 1 exp(-c 2 n),c 1 ,c 2 > 0. (14) 
Thus the first condition in theorem 6 is satisfied. 
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Figure 1: Prior realizations from the GPT for gestational age at delivery (solid lines) along with 
frequentist kernel density estimate (dotted lines). The rows correspond to </>i=(0.01, 0.1); the 
columns correspond to c/»2=(0.1, 1,25,100). 
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(a) Marron Wand 2nd curve 



(b) Marron Wand 6th curve 





ure 2: Marron- Wand curves - density estimates for GPT, DPM and Polya tree mixtures. 
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10th percentile of DDE (12.57) 



60th percentile of DDE (28.44) 




Figure 3: GPT conditional density estimates and 90% credible intervals for 10th, 60th, 90th, 99th 
DDE quantiles. Vertical dashed line for cut-off at 37 weeks. 
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10th percentile of DDE (12.57) 



60th percentile of DDE (28.44) 




Figure 4: DPM conditional density estimates and 90% credible intervals for 10th, 60th, 90th, 99th 
DDE quantiles. Vertical dashed line for cut-off at 37 weeks. 
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Figure 5: Estimated probability that gestational age at delivery is less than T weeks versus DDE 
dose, for (a) T = 33, (b) T = 35, (c) T = 37, (d) T = 40. Solid lines are posterior means and 
dashed lines are pointwise 90% credible intervals. 
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